{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Text Processing Exercise \n",
    "\n",
    "In this exerise, you will learn some building blocks for text processing . You will learn how to normalize, tokenize, stemmeize, and lemmatize tweets from Twitter."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Fetch Data from the online resource\n",
    "\n",
    "First, we will use the `get_tweets()` function from the `exercise_helper` module to get all the tweets from the following Twitter page https://twitter.com/AIForTrading1. This website corresponds to a Twitter account created especially for this course. This webiste contains 28 tweets, and our goal will be to get all these 28 tweets. The `get_tweets()` function uses the `requests` library and BeautifulSoup to get all the tweets from our website. In a later lesson we will learn how the use the `requests` library and BeautifulSoup to get data from websites. For now, we will just use this function to help us get the tweets we want."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['The Long-Term Stock Exchange Is Worth a Shot', 'Predicting Stock Performance with Natural Language Deep Learning', 'Comcast Acquiring Time Warner Cable In All Stock Deal Worth $45.2 Billion', 'Facebook stock drops more than 20% after revenue forecast misses', 'Facebook Buying WhatsApp for $16B in Cash and Stock Plus $3B in RSUs', 'Netflix’s ‘death cross’ is the third for FAANG stocks and Nasdaq Composite is next', 'After Yesterday’s Signs of Recovery, Crypto Markets See Drastic Losses', 'MF Sees Australia Risks Tilt to Downside on China, Trade War', 'Bitcoin Cash Clash Is Costing Billions With No End in Sight', 'SEC Crypto Settlements Spur Expectations of Wider ICO Crackdown', 'Nissan’s Drama Looks a Lot Like a Palace Coup', 'Yahoo Finance has apparently killed its API', 'Tesla Tanks After Goldman Downgrades to Sell', 'Goldman Sachs to Open a Bitcoin Trading Operation', 'Tax-Free Bitcoin-To-Ether Trading in US to End Under GOP Plan', 'Goldman Sachs Is Setting Up a Cryptocurrency Trading Desk', 'Robinhood stock trading app confirms $110M raise at $1.3B valuation', 'How I made $500k with machine learning and high frequency trading', \"Tesla's Finance Team Is Losing Another Top Executive\", 'Finance sites erroneously show Amazon, Apple, other stocks crashing', 'Jeff Bezos Says He Is Selling $1 Billion a Year in Amazon Stock to Finance Race to Space', 'US government commits to publish publicly financed software under Free Software licenses', 'The dream life is having your luggage first out of the carousel each time.', 'Stocks Sink as Apple, Facebook Pace the Tech Wreck: Markets Wrap', \"Elon Musk's SpaceX Cuts Loan Deal by $500 Million\", 'Nvidia Stock Falls Another 12%', 'Anything is possible in this world! Exhibit A: Creation of a sequel to Superbabies.', 'Elon Musk forced to step down as chairman, TSLA short all the way!']\n"
     ]
    }
   ],
   "source": [
    "import exercise_helper\n",
    "\n",
    "all_tweets = exercise_helper.get_tweets()\n",
    "\n",
    "print(all_tweets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Normalization\n",
    "Text normalization is the process of transforming text into a single canonical form.\n",
    "\n",
    "There are many normalization techniques, however, in this exercise we focus on two methods. First, we'll converting the text into lowercase and second, remove all the punctuation characters the text.\n",
    "\n",
    "#### TODO: Part 1\n",
    "\n",
    "Convert text to lowercase.\n",
    "\n",
    "Use the Python built-in method `.lower()` for converting each tweet in `all_tweets` into the lower case."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['the long-term stock exchange is worth a shot', 'predicting stock performance with natural language deep learning', 'comcast acquiring time warner cable in all stock deal worth $45.2 billion', 'facebook stock drops more than 20% after revenue forecast misses', 'facebook buying whatsapp for $16b in cash and stock plus $3b in rsus', 'netflix’s ‘death cross’ is the third for faang stocks and nasdaq composite is next', 'after yesterday’s signs of recovery, crypto markets see drastic losses', 'mf sees australia risks tilt to downside on china, trade war', 'bitcoin cash clash is costing billions with no end in sight', 'sec crypto settlements spur expectations of wider ico crackdown', 'nissan’s drama looks a lot like a palace coup', 'yahoo finance has apparently killed its api', 'tesla tanks after goldman downgrades to sell', 'goldman sachs to open a bitcoin trading operation', 'tax-free bitcoin-to-ether trading in us to end under gop plan', 'goldman sachs is setting up a cryptocurrency trading desk', 'robinhood stock trading app confirms $110m raise at $1.3b valuation', 'how i made $500k with machine learning and high frequency trading', \"tesla's finance team is losing another top executive\", 'finance sites erroneously show amazon, apple, other stocks crashing', 'jeff bezos says he is selling $1 billion a year in amazon stock to finance race to space', 'us government commits to publish publicly financed software under free software licenses', 'the dream life is having your luggage first out of the carousel each time.', 'stocks sink as apple, facebook pace the tech wreck: markets wrap', \"elon musk's spacex cuts loan deal by $500 million\", 'nvidia stock falls another 12%', 'anything is possible in this world! exhibit a: creation of a sequel to superbabies.', 'elon musk forced to step down as chairman, tsla short all the way!']\n"
     ]
    }
   ],
   "source": [
    "# your code goes here\n",
    "\n",
    "counter = 0\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    all_tweets[counter] = tweet.lower()\n",
    "    counter += 1\n",
    "\n",
    "print(all_tweets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Part 2 \n",
    "\n",
    "Here, we are using `Regular Expression` library to remove punctuation characters. \n",
    "\n",
    "The easiest way to remove specific punctuation characters is with regex, the `re` module. You can sub out specific patterns with a space:\n",
    "\n",
    "```python\n",
    "re.sub(pattern, ' ', text) \n",
    "```\n",
    "\n",
    "This will substitute a space with anywhere the pattern matches in the text. \n",
    "\n",
    "Pattern for punctuation is the following `[^a-zA-Z0-9]`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['the long term stock exchange is worth a shot', 'predicting stock performance with natural language deep learning', 'comcast acquiring time warner cable in all stock deal worth  45 2 billion', 'facebook stock drops more than 20  after revenue forecast misses', 'facebook buying whatsapp for  16b in cash and stock plus  3b in rsus', 'netflix s  death cross  is the third for faang stocks and nasdaq composite is next', 'after yesterday s signs of recovery  crypto markets see drastic losses', 'mf sees australia risks tilt to downside on china  trade war', 'bitcoin cash clash is costing billions with no end in sight', 'sec crypto settlements spur expectations of wider ico crackdown', 'nissan s drama looks a lot like a palace coup', 'yahoo finance has apparently killed its api', 'tesla tanks after goldman downgrades to sell', 'goldman sachs to open a bitcoin trading operation', 'tax free bitcoin to ether trading in us to end under gop plan', 'goldman sachs is setting up a cryptocurrency trading desk', 'robinhood stock trading app confirms  110m raise at  1 3b valuation', 'how i made  500k with machine learning and high frequency trading', 'tesla s finance team is losing another top executive', 'finance sites erroneously show amazon  apple  other stocks crashing', 'jeff bezos says he is selling  1 billion a year in amazon stock to finance race to space', 'us government commits to publish publicly financed software under free software licenses', 'the dream life is having your luggage first out of the carousel each time ', 'stocks sink as apple  facebook pace the tech wreck  markets wrap', 'elon musk s spacex cuts loan deal by  500 million', 'nvidia stock falls another 12 ', 'anything is possible in this world  exhibit a  creation of a sequel to superbabies ', 'elon musk forced to step down as chairman  tsla short all the way ']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "counter = 0\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    all_tweets[counter] = re.sub(r'[^a-zA-Z0-9]', ' ', tweet) \n",
    "    counter += 1\n",
    "\n",
    "print(all_tweets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### NLTK: Natural Language ToolKit\n",
    "\n",
    "NLTK is a leading platform for building Python programs to work with human language data. It has a suite of tools for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. \n",
    "\n",
    "Let's import NLTK. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os \n",
    "import nltk \n",
    "nltk.data.path.append(os.path.join(os.getcwd(), \"nltk_data\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TODO: Part 1\n",
    "\n",
    "NLTK has `TweetTokenizer` method that splits tweets into tokens.\n",
    "\n",
    "This make tokenizng tweets much easier and faster. \n",
    "\n",
    "For `TweetTokenizer`, you can pass the following argument `(preserve_case= False)` to make your tokens in lower case. In the cell below tokenize each tweet in `all_tweets` "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['the', 'long', 'term', 'stock', 'exchange', 'is', 'worth', 'a', 'shot']\n",
      "['predicting', 'stock', 'performance', 'with', 'natural', 'language', 'deep', 'learning']\n",
      "['comcast', 'acquiring', 'time', 'warner', 'cable', 'in', 'all', 'stock', 'deal', 'worth', '45', '2', 'billion']\n",
      "['facebook', 'stock', 'drops', 'more', 'than', '20', 'after', 'revenue', 'forecast', 'misses']\n",
      "['facebook', 'buying', 'whatsapp', 'for', '16b', 'in', 'cash', 'and', 'stock', 'plus', '3b', 'in', 'rsus']\n",
      "['netflix', 's', 'death', 'cross', 'is', 'the', 'third', 'for', 'faang', 'stocks', 'and', 'nasdaq', 'composite', 'is', 'next']\n",
      "['after', 'yesterday', 's', 'signs', 'of', 'recovery', 'crypto', 'markets', 'see', 'drastic', 'losses']\n",
      "['mf', 'sees', 'australia', 'risks', 'tilt', 'to', 'downside', 'on', 'china', 'trade', 'war']\n",
      "['bitcoin', 'cash', 'clash', 'is', 'costing', 'billions', 'with', 'no', 'end', 'in', 'sight']\n",
      "['sec', 'crypto', 'settlements', 'spur', 'expectations', 'of', 'wider', 'ico', 'crackdown']\n",
      "['nissan', 's', 'drama', 'looks', 'a', 'lot', 'like', 'a', 'palace', 'coup']\n",
      "['yahoo', 'finance', 'has', 'apparently', 'killed', 'its', 'api']\n",
      "['tesla', 'tanks', 'after', 'goldman', 'downgrades', 'to', 'sell']\n",
      "['goldman', 'sachs', 'to', 'open', 'a', 'bitcoin', 'trading', 'operation']\n",
      "['tax', 'free', 'bitcoin', 'to', 'ether', 'trading', 'in', 'us', 'to', 'end', 'under', 'gop', 'plan']\n",
      "['goldman', 'sachs', 'is', 'setting', 'up', 'a', 'cryptocurrency', 'trading', 'desk']\n",
      "['robinhood', 'stock', 'trading', 'app', 'confirms', '110m', 'raise', 'at', '1', '3b', 'valuation']\n",
      "['how', 'i', 'made', '500k', 'with', 'machine', 'learning', 'and', 'high', 'frequency', 'trading']\n",
      "['tesla', 's', 'finance', 'team', 'is', 'losing', 'another', 'top', 'executive']\n",
      "['finance', 'sites', 'erroneously', 'show', 'amazon', 'apple', 'other', 'stocks', 'crashing']\n",
      "['jeff', 'bezos', 'says', 'he', 'is', 'selling', '1', 'billion', 'a', 'year', 'in', 'amazon', 'stock', 'to', 'finance', 'race', 'to', 'space']\n",
      "['us', 'government', 'commits', 'to', 'publish', 'publicly', 'financed', 'software', 'under', 'free', 'software', 'licenses']\n",
      "['the', 'dream', 'life', 'is', 'having', 'your', 'luggage', 'first', 'out', 'of', 'the', 'carousel', 'each', 'time']\n",
      "['stocks', 'sink', 'as', 'apple', 'facebook', 'pace', 'the', 'tech', 'wreck', 'markets', 'wrap']\n",
      "['elon', 'musk', 's', 'spacex', 'cuts', 'loan', 'deal', 'by', '500', 'million']\n",
      "['nvidia', 'stock', 'falls', 'another', '12']\n",
      "['anything', 'is', 'possible', 'in', 'this', 'world', 'exhibit', 'a', 'creation', 'of', 'a', 'sequel', 'to', 'superbabies']\n",
      "['elon', 'musk', 'forced', 'to', 'step', 'down', 'as', 'chairman', 'tsla', 'short', 'all', 'the', 'way']\n"
     ]
    }
   ],
   "source": [
    "from nltk.tokenize import TweetTokenizer\n",
    "\n",
    "#  your code goes here\n",
    "tknzr = TweetTokenizer(preserve_case= False)\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    print(tknzr.tokenize(tweet))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Part 2\n",
    "\n",
    "NLTK adds more modularity for tokenization.\n",
    "\n",
    "For example, stop words are words which do not contain important significance to be used in text analysis. They are repetitive words such as \"the\", \"and\", \"if\", etc. Ideally, we want to remove these words from our tokenized lists. \n",
    "\n",
    "NLTK has a list of these words, `nltk.corpus.stopwords`, which you actually need to download through `nltk.download`.\n",
    "\n",
    "Let's print out stopwords in English to see what these words are. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to\n",
      "[nltk_data]     /Users/jdelgado/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from nltk.corpus import stopwords\n",
    "nltk.download(\"stopwords\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO: \n",
    "\n",
    "print stop words in English"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', \"you're\", \"you've\", \"you'll\", \"you'd\", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', \"she's\", 'her', 'hers', 'herself', 'it', \"it's\", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', \"that'll\", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', \"don't\", 'should', \"should've\", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', \"aren't\", 'couldn', \"couldn't\", 'didn', \"didn't\", 'doesn', \"doesn't\", 'hadn', \"hadn't\", 'hasn', \"hasn't\", 'haven', \"haven't\", 'isn', \"isn't\", 'ma', 'mightn', \"mightn't\", 'mustn', \"mustn't\", 'needn', \"needn't\", 'shan', \"shan't\", 'shouldn', \"shouldn't\", 'wasn', \"wasn't\", 'weren', \"weren't\", 'won', \"won't\", 'wouldn', \"wouldn't\"]\n"
     ]
    }
   ],
   "source": [
    "# your code is here\n",
    "print(stopwords.words('english'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TODO: Part 3 \n",
    "\n",
    "In the cell below use the `.split()` method to split each tweet into a list of words and remove the stop words from all the tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['long', 'term', 'stock', 'exchange', 'worth', 'shot']\n",
      "['predicting', 'stock', 'performance', 'natural', 'language', 'deep', 'learning']\n",
      "['comcast', 'acquiring', 'time', 'warner', 'cable', 'stock', 'deal', 'worth', '45', '2', 'billion']\n",
      "['facebook', 'stock', 'drops', '20', 'revenue', 'forecast', 'misses']\n",
      "['facebook', 'buying', 'whatsapp', '16b', 'cash', 'stock', 'plus', '3b', 'rsus']\n",
      "['netflix', 'death', 'cross', 'third', 'faang', 'stocks', 'nasdaq', 'composite', 'next']\n",
      "['yesterday', 'signs', 'recovery', 'crypto', 'markets', 'see', 'drastic', 'losses']\n",
      "['mf', 'sees', 'australia', 'risks', 'tilt', 'downside', 'china', 'trade', 'war']\n",
      "['bitcoin', 'cash', 'clash', 'costing', 'billions', 'end', 'sight']\n",
      "['sec', 'crypto', 'settlements', 'spur', 'expectations', 'wider', 'ico', 'crackdown']\n",
      "['nissan', 'drama', 'looks', 'lot', 'like', 'palace', 'coup']\n",
      "['yahoo', 'finance', 'apparently', 'killed', 'api']\n",
      "['tesla', 'tanks', 'goldman', 'downgrades', 'sell']\n",
      "['goldman', 'sachs', 'open', 'bitcoin', 'trading', 'operation']\n",
      "['tax', 'free', 'bitcoin', 'ether', 'trading', 'us', 'end', 'gop', 'plan']\n",
      "['goldman', 'sachs', 'setting', 'cryptocurrency', 'trading', 'desk']\n",
      "['robinhood', 'stock', 'trading', 'app', 'confirms', '110m', 'raise', '1', '3b', 'valuation']\n",
      "['made', '500k', 'machine', 'learning', 'high', 'frequency', 'trading']\n",
      "['tesla', 'finance', 'team', 'losing', 'another', 'top', 'executive']\n",
      "['finance', 'sites', 'erroneously', 'show', 'amazon', 'apple', 'stocks', 'crashing']\n",
      "['jeff', 'bezos', 'says', 'selling', '1', 'billion', 'year', 'amazon', 'stock', 'finance', 'race', 'space']\n",
      "['us', 'government', 'commits', 'publish', 'publicly', 'financed', 'software', 'free', 'software', 'licenses']\n",
      "['dream', 'life', 'luggage', 'first', 'carousel', 'time']\n",
      "['stocks', 'sink', 'apple', 'facebook', 'pace', 'tech', 'wreck', 'markets', 'wrap']\n",
      "['elon', 'musk', 'spacex', 'cuts', 'loan', 'deal', '500', 'million']\n",
      "['nvidia', 'stock', 'falls', 'another', '12']\n",
      "['anything', 'possible', 'world', 'exhibit', 'creation', 'sequel', 'superbabies']\n",
      "['elon', 'musk', 'forced', 'step', 'chairman', 'tsla', 'short', 'way']\n"
     ]
    }
   ],
   "source": [
    "## your code is here \n",
    "\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    words = tweet.split()\n",
    "\n",
    "    print([w for w in words if w not in stopwords.words(\"english\")])\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Stemming\n",
    "Stemming is the process of reducing words to their word stem, base or root form.\n",
    "\n",
    "In the cell below, use  the `PorterStemmer` method from the ntlk library to perform stemming on all the tweets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['long', 'term', 'stock', 'exchang', 'worth', 'shot']\n",
      "['predict', 'stock', 'perform', 'natur', 'languag', 'deep', 'learn']\n",
      "['comcast', 'acquir', 'time', 'warner', 'cabl', 'stock', 'deal', 'worth', '45', '2', 'billion']\n",
      "['facebook', 'stock', 'drop', '20', 'revenu', 'forecast', 'miss']\n",
      "['facebook', 'buy', 'whatsapp', '16b', 'cash', 'stock', 'plu', '3b', 'rsu']\n",
      "['netflix', 'death', 'cross', 'third', 'faang', 'stock', 'nasdaq', 'composit', 'next']\n",
      "['yesterday', 'sign', 'recoveri', 'crypto', 'market', 'see', 'drastic', 'loss']\n",
      "['mf', 'see', 'australia', 'risk', 'tilt', 'downsid', 'china', 'trade', 'war']\n",
      "['bitcoin', 'cash', 'clash', 'cost', 'billion', 'end', 'sight']\n",
      "['sec', 'crypto', 'settlement', 'spur', 'expect', 'wider', 'ico', 'crackdown']\n",
      "['nissan', 'drama', 'look', 'lot', 'like', 'palac', 'coup']\n",
      "['yahoo', 'financ', 'appar', 'kill', 'api']\n",
      "['tesla', 'tank', 'goldman', 'downgrad', 'sell']\n",
      "['goldman', 'sach', 'open', 'bitcoin', 'trade', 'oper']\n",
      "['tax', 'free', 'bitcoin', 'ether', 'trade', 'us', 'end', 'gop', 'plan']\n",
      "['goldman', 'sach', 'set', 'cryptocurr', 'trade', 'desk']\n",
      "['robinhood', 'stock', 'trade', 'app', 'confirm', '110m', 'rais', '1', '3b', 'valuat']\n",
      "['made', '500k', 'machin', 'learn', 'high', 'frequenc', 'trade']\n",
      "['tesla', 'financ', 'team', 'lose', 'anoth', 'top', 'execut']\n",
      "['financ', 'site', 'erron', 'show', 'amazon', 'appl', 'stock', 'crash']\n",
      "['jeff', 'bezo', 'say', 'sell', '1', 'billion', 'year', 'amazon', 'stock', 'financ', 'race', 'space']\n",
      "['us', 'govern', 'commit', 'publish', 'publicli', 'financ', 'softwar', 'free', 'softwar', 'licens']\n",
      "['dream', 'life', 'luggag', 'first', 'carousel', 'time']\n",
      "['stock', 'sink', 'appl', 'facebook', 'pace', 'tech', 'wreck', 'market', 'wrap']\n",
      "['elon', 'musk', 'spacex', 'cut', 'loan', 'deal', '500', 'million']\n",
      "['nvidia', 'stock', 'fall', 'anoth', '12']\n",
      "['anyth', 'possibl', 'world', 'exhibit', 'creation', 'sequel', 'superbabi']\n",
      "['elon', 'musk', 'forc', 'step', 'chairman', 'tsla', 'short', 'way']\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem.porter import PorterStemmer\n",
    "\n",
    "# your code goes here\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    words = tweet.split()\n",
    "\n",
    "    new_words = [w for w in words if w not in stopwords.words(\"english\")]\n",
    "    \n",
    "    print([PorterStemmer().stem(w) for w in new_words])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lemmatizing\n",
    "#### Part 1\n",
    "\n",
    "Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item.\n",
    "\n",
    "For reducing the words into their root form, you can use `WordNetLemmatizer()` method. \n",
    "\n",
    "For more information about lemmatzing in NLTK, please take a look at NLTK documentation https://www.nltk.org/api/nltk.stem.html\n",
    "\n",
    "If you like to understand more about Stemming and Lemmatizing, take a look at the following source: \n",
    "https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package wordnet to\n",
      "[nltk_data]     /Users/jdelgado/nltk_data...\n",
      "[nltk_data]   Package wordnet is already up-to-date!\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nltk.download('wordnet') ### download this part "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### TODO:\n",
    "\n",
    "In the cell below, use the `WordNetLemmatizer()` method to lemmatize all the tweets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['long', 'term', 'stock', 'exchange', 'worth', 'shot']\n",
      "['predicting', 'stock', 'performance', 'natural', 'language', 'deep', 'learning']\n",
      "['comcast', 'acquiring', 'time', 'warner', 'cable', 'stock', 'deal', 'worth', '45', '2', 'billion']\n",
      "['facebook', 'stock', 'drop', '20', 'revenue', 'forecast', 'miss']\n",
      "['facebook', 'buying', 'whatsapp', '16b', 'cash', 'stock', 'plus', '3b', 'rsus']\n",
      "['netflix', 'death', 'cross', 'third', 'faang', 'stock', 'nasdaq', 'composite', 'next']\n",
      "['yesterday', 'sign', 'recovery', 'crypto', 'market', 'see', 'drastic', 'loss']\n",
      "['mf', 'see', 'australia', 'risk', 'tilt', 'downside', 'china', 'trade', 'war']\n",
      "['bitcoin', 'cash', 'clash', 'costing', 'billion', 'end', 'sight']\n",
      "['sec', 'crypto', 'settlement', 'spur', 'expectation', 'wider', 'ico', 'crackdown']\n",
      "['nissan', 'drama', 'look', 'lot', 'like', 'palace', 'coup']\n",
      "['yahoo', 'finance', 'apparently', 'killed', 'api']\n",
      "['tesla', 'tank', 'goldman', 'downgrade', 'sell']\n",
      "['goldman', 'sachs', 'open', 'bitcoin', 'trading', 'operation']\n",
      "['tax', 'free', 'bitcoin', 'ether', 'trading', 'u', 'end', 'gop', 'plan']\n",
      "['goldman', 'sachs', 'setting', 'cryptocurrency', 'trading', 'desk']\n",
      "['robinhood', 'stock', 'trading', 'app', 'confirms', '110m', 'raise', '1', '3b', 'valuation']\n",
      "['made', '500k', 'machine', 'learning', 'high', 'frequency', 'trading']\n",
      "['tesla', 'finance', 'team', 'losing', 'another', 'top', 'executive']\n",
      "['finance', 'site', 'erroneously', 'show', 'amazon', 'apple', 'stock', 'crashing']\n",
      "['jeff', 'bezos', 'say', 'selling', '1', 'billion', 'year', 'amazon', 'stock', 'finance', 'race', 'space']\n",
      "['u', 'government', 'commits', 'publish', 'publicly', 'financed', 'software', 'free', 'software', 'license']\n",
      "['dream', 'life', 'luggage', 'first', 'carousel', 'time']\n",
      "['stock', 'sink', 'apple', 'facebook', 'pace', 'tech', 'wreck', 'market', 'wrap']\n",
      "['elon', 'musk', 'spacex', 'cut', 'loan', 'deal', '500', 'million']\n",
      "['nvidia', 'stock', 'fall', 'another', '12']\n",
      "['anything', 'possible', 'world', 'exhibit', 'creation', 'sequel', 'superbabies']\n",
      "['elon', 'musk', 'forced', 'step', 'chairman', 'tsla', 'short', 'way']\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "\n",
    "# your code goes here\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    words = tweet.split()\n",
    "\n",
    "    new_words = [w for w in words if w not in stopwords.words(\"english\")]\n",
    "\n",
    "    print([WordNetLemmatizer().lemmatize(w) for w in new_words])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TODO: Part 2\n",
    "\n",
    "In the cell below, lemmatize verbs by specifying `pos`. For `WordNetLemmatizer().lemmatize` add `pos` as an argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['long', 'term', 'stock', 'exchange', 'worth', 'shoot']\n",
      "['predict', 'stock', 'performance', 'natural', 'language', 'deep', 'learn']\n",
      "['comcast', 'acquire', 'time', 'warner', 'cable', 'stock', 'deal', 'worth', '45', '2', 'billion']\n",
      "['facebook', 'stock', 'drop', '20', 'revenue', 'forecast', 'miss']\n",
      "['facebook', 'buy', 'whatsapp', '16b', 'cash', 'stock', 'plus', '3b', 'rsus']\n",
      "['netflix', 'death', 'cross', 'third', 'faang', 'stock', 'nasdaq', 'composite', 'next']\n",
      "['yesterday', 'sign', 'recovery', 'crypto', 'market', 'see', 'drastic', 'losses']\n",
      "['mf', 'see', 'australia', 'risk', 'tilt', 'downside', 'china', 'trade', 'war']\n",
      "['bitcoin', 'cash', 'clash', 'cost', 'billions', 'end', 'sight']\n",
      "['sec', 'crypto', 'settlements', 'spur', 'expectations', 'wider', 'ico', 'crackdown']\n",
      "['nissan', 'drama', 'look', 'lot', 'like', 'palace', 'coup']\n",
      "['yahoo', 'finance', 'apparently', 'kill', 'api']\n",
      "['tesla', 'tank', 'goldman', 'downgrade', 'sell']\n",
      "['goldman', 'sachs', 'open', 'bitcoin', 'trade', 'operation']\n",
      "['tax', 'free', 'bitcoin', 'ether', 'trade', 'us', 'end', 'gop', 'plan']\n",
      "['goldman', 'sachs', 'set', 'cryptocurrency', 'trade', 'desk']\n",
      "['robinhood', 'stock', 'trade', 'app', 'confirm', '110m', 'raise', '1', '3b', 'valuation']\n",
      "['make', '500k', 'machine', 'learn', 'high', 'frequency', 'trade']\n",
      "['tesla', 'finance', 'team', 'lose', 'another', 'top', 'executive']\n",
      "['finance', 'sit', 'erroneously', 'show', 'amazon', 'apple', 'stock', 'crash']\n",
      "['jeff', 'bezos', 'say', 'sell', '1', 'billion', 'year', 'amazon', 'stock', 'finance', 'race', 'space']\n",
      "['us', 'government', 'commit', 'publish', 'publicly', 'finance', 'software', 'free', 'software', 'license']\n",
      "['dream', 'life', 'luggage', 'first', 'carousel', 'time']\n",
      "['stock', 'sink', 'apple', 'facebook', 'pace', 'tech', 'wreck', 'market', 'wrap']\n",
      "['elon', 'musk', 'spacex', 'cut', 'loan', 'deal', '500', 'million']\n",
      "['nvidia', 'stock', 'fall', 'another', '12']\n",
      "['anything', 'possible', 'world', 'exhibit', 'creation', 'sequel', 'superbabies']\n",
      "['elon', 'musk', 'force', 'step', 'chairman', 'tsla', 'short', 'way']\n"
     ]
    }
   ],
   "source": [
    "from nltk.stem.wordnet import WordNetLemmatizer\n",
    "\n",
    "# your code goes here\n",
    "\n",
    "for tweet in all_tweets:\n",
    "    words = tweet.split()\n",
    "\n",
    "    new_words = [w for w in words if w not in stopwords.words(\"english\")]\n",
    "    \n",
    "    print([WordNetLemmatizer().lemmatize(w, pos='v') for w in new_words])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
